Mylly - The Mill: A New Platform for Processing Speech and Text Corpora Easily and Efficiently

نویسندگان

  • Mietta Lennes
  • Jussi Piitulainen
  • Martin Matthiesen
چکیده

Speech and language researchers need to manage and analyze increasing quantities of material. Various tools are available for various stages of the work, but they often require the researcher to use different interfaces and to convert the output from each tool into suitable input for the next one. The Language Bank of Finland (Kielipankki) is developing an on-line platform called Mylly for processing speech and language data in a graphical user interface that integrates different tools into a single workflow. Mylly provides tools and computational resources for processing material and for the inspecting the results. The tools plugged into Mylly include a parser, morphological analyzers, generic finite-state technology, and a speech recognizer. Users can upload data and download any intermediate results in the tool chain. Mylly runs on CSC’s Taito cluster and is an instance of the Chipster platform. Access rights to Mylly are given for academic use. The Language Bank of Finland is a collection of corpora, tools and other services maintained by FIN-CLARIN, a consortium of Finnish universities and research organizations coordinated by the University of Helsinki. The technological infrastructure for the Language Bank of Finland is provided by CSC – IT Center for Science.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Arabic News Articles Classification Using Vectorized-Cosine Based on Seed Documents

Besides for its own merits, text classification (TC) has become a cornerstone in many applications. Work presented here is part of and a pre-requisite for a project we have overtaken to create a corpus for the Arabic text process. It is an attempt to create modules automatically that would help speed up the process of classification for any text categorization task. It also serves as a tool for...

متن کامل

A new model for persian multi-part words edition based on statistical machine translation

Multi-part words in English language are hyphenated and hyphen is used to separate different parts. Persian language consists of multi-part words as well. Based on Persian morphology, half-space character is needed to separate parts of multi-part words where in many cases people incorrectly use space character instead of half-space character. This common incorrectly use of space leads to some s...

متن کامل

پیکره اعلام: یک پیکره استاندارد واحدهای اسمی برای زبان فارسی

Named entity recognition (NER) is a natural language processing (NLP) problem that is mainly used for text summarization, data mining, data retrieval, question and answering, machine translation, and document classification systems. A NER system is tasked with determining the border of each named entity, recognizing its type and classifying it into predefined categories. The categories of named...

متن کامل

A New Annotation Tool for Aligned Bilingual Corpora

This paper presents a new annotation tool for aligned bilingual corpora, which allows the annotation of a wide range of information, ranging from information about words (such as part-of-speech tags or named-entities) to quite complex annotation schemas involving links between aligned segments, such as co-reference or translation equivalence between aligned segments in the two languages. The an...

متن کامل

Spoken Content-Based Audio Navigation (SCAN)

We describe SCAN, a system for retrieving and browsing speech documents from large audio corpora that uses new information retrieval and speech processing techniques to create easily navigable presentations of documents relevant to a user query. Experiments show that the new interface is more effective than simple speechalone interfaces.

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2017